The goal of this notebook is to identify other clusters of mutations
with same approach used to define the mutations in Genome-1
and Genome-2. Hopefully, these can be used to establish
subclonal haplotypes on the background of Genome-1 and
Genome-2.
The inputs for this analysis are (1) the filtered and combined
variants annotated by major genotypes from
identify-backgrounds.Rmd and (2) annotations of the
positions of the main Measles open reading frames.
## ==== File paths input data ==== ##
# Data from lofreq and varscan labeled with G1 and G2 mutations
# labeled.data = "../../results/variants/genotyped_variants.csv"
labeled.data = snakemake@params[['incsv']]
# Annotations
# annotations.filepath = "../../config/annotations.csv"
annotations.filepath = snakemake@params[['annotations']]
The output of this notebook are the filtered variants further annotated with clusters of mutations that make up probable subclonal haplotypes.
## ==== File paths output data ==== ##
# Path to save the results
output.path = "../../results/variants/clustered_variants.csv"
output.path = snakemake@params[['outcsv']]
Since the goal of this notebook is use the same technique that I used
to identify Genome-1, Genome-2, and
Genome-1-1, but for mutations that aren’t necessarily found
in every single tissue, I need to account for missing mutations. To do
this, I spread the data frame of frequencies and assign an allele
frequency of 0 if a variant is missing from a tissue. This will allow us
to cluster all of the mutations.
# Read in the variants with G1, G2, and G1-1 labeled
labeled.df = read_csv(labeled.data, show_col_types = FALSE)
haplotypes.label = labeled.df %>%
select(SNP, Haplotype, Background) %>%
distinct()
# Expand the mutations to have a frequency for every tissue.
expanded.df = labeled.df %>%
select(SNP, Tissue, AF) %>%
pivot_wider(names_from = "Tissue", values_from = "AF", values_fill = 0) %>%
pivot_longer(cols = !SNP, names_to = "Tissue", values_to = "AF") %>%
left_join(., select(labeled.df, c("SNP", "Tissue", "DP")), by = c("SNP", "Tissue")) %>%
mutate(DP = if_else(is.na(DP), 0, DP)) %>%
left_join(., haplotypes.label, by = "SNP")
# Get the mean frequency of the major genomes
tissue.mean = labeled.df %>%
filter(Haplotype %in% c("genome-1", "genome-2")) %>%
group_by(Tissue, Haplotype) %>%
summarize(AF.mean = mean(AF, na.rm = TRUE),
SD = sd(AF, na.rm = TRUE),
N = n()) %>%
mutate(SE = SD / sqrt(N),
Lower.CI = qt(1 - (0.05 / 2), N - 1) * SE,
Upper.CI = qt(1 - (0.05 / 2), N - 1) * SE) %>%
rename("AF" = AF.mean)
# Tissue order - roughly by the location in the brain
tissue_order = c(
"SSPE 1",
"SSPE 2",
"Frontal Cortex 2",
"Frontal Cortex 1",
"Frontal Cortex 3",
"Parietal Lobe",
"Temporal Lobe",
"Occipital Lobe",
"Hippocampus",
"Internal Capsule",
"Midbrain",
"UBS",
"Brain Stem",
"Cerebellum",
"Cerebellum Nucleus"
)
I’ll write the approach I used in
identify-backgrounds.Rmd into a reusable function. I can
provide a list of SNPs, the full data frame, and the number of clusters
I’m targeting.
cluster.snps <- function(list.of.snps, snp.df, n.clusters) {
# SNP as column and Allele frequency as row
frequency.by.snp = snp.df %>%
filter(SNP %in% list.of.snps) %>%
select(AF, SNP, Tissue) %>%
pivot_wider(names_from = SNP, values_from = AF, values_fill = 0) %>%
select(!Tissue)
# Calculate R between every pair of columns while handling NAs
snp.correlation = cor(frequency.by.snp, use="pairwise.complete.obs")
# Convert to distance (positive corr is close to 0)
snp.dist = as.dist(1 - snp.correlation)
# Create k-medoids clustering with n clusters
snp.kmedoids = pam(snp.dist, n.clusters)
kclusters = snp.kmedoids$cluster
# Convert to a data frame
kclusters.df = data.frame(SNP = names(kclusters), cluster = kclusters)
# Assign the clusters to the original data frame
kmedoids.SNPs = snp.df %>%
filter(SNP %in% list.of.snps) %>%
left_join(., kclusters.df, by = "SNP") %>%
mutate(cluster = if_else(is.na(cluster), "no cluster", as.character(cluster)))
n.clusters.per.snp = kmedoids.SNPs %>%
select(SNP, cluster) %>%
distinct() %>%
group_by(cluster) %>%
count() %>%
mutate(cluster_size = paste0(cluster, " (", n, " snps in cluster)"))
return(left_join(kmedoids.SNPs, n.clusters.per.snp, by = "cluster"))
}
Although this approach works reasonably well, it’s very susceptible to noise. This means that you can’t just provide all the variants at once and expect the algorithm to partition them into correct clusters. Instead, the best approach I found was to break variants down into bins based on their frequency and cluster these separately. The higher average frequency, the less noise, and the better the algorithm is at partitioning the variants into meaningful clusters.
First, I’ll start with a ‘high-frequency’ bin. This includes all variants that reach greater than or equal to 25% in at least one tissue.
# SNPs that are above 25% at some point in some tissue
twentyfive.percent.or.more.snps = expanded.df %>%
filter(Haplotype == "subclonal") %>% # Not including the mutations we already haplotyped
filter(AF >= 0.25) %>%
pull(SNP) %>%
unique(.)
# Cluster the SNPs
twentyfive.percent.or.more.clusters.df = cluster.snps(list.of.snps = twentyfive.percent.or.more.snps,
snp.df = expanded.df,
n.clusters = 20)
# Mean of each tissue ~ for plotting
twentyfive.percent.or.more.clusters.mean = twentyfive.percent.or.more.clusters.df %>%
group_by(Tissue, cluster, cluster_size, n) %>%
summarize(AF = mean(AF))
## `summarise()` has grouped output by 'Tissue', 'cluster', 'cluster_size'. You
## can override using the `.groups` argument.
# Plot the clusters
twentyfive.percent.or.more.clusters.df %>%
ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
geom_line(aes(group = SNP)) +
geom_line(data = twentyfive.percent.or.more.clusters.mean, aes(x = Tissue, y = AF, group = 1, col = (cluster)), size = 1) +
geom_ribbon(data = tissue.mean, aes(x=Tissue, ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
facet_wrap(~cluster_size) +
scale_fill_manual(values=c("#424ef5", "#cf1919")) +
xlab("Tissue") +
ylab("Allele Frequency") +
labs(fill="Haplotype", col = "Cluster") +
theme_bw(20) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(strip.text.x = element_text(size = 12))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Some clusters have only a single mutation in them. Clearly, some of
these belong to Genome-1, Genome-2, and
Genome-1-1, but are missing from a single tissue, perhaps
due to low coverage in that particular tissue.
high.frequency.not.clustered.snps = twentyfive.percent.or.more.clusters.df %>%
filter(n == 1) %>% # clusters that have only a single SNP in them
pull(SNP)
expanded.df %>%
filter(SNP %in% high.frequency.not.clustered.snps) %>%
left_join(., distinct(select(labeled.df, SNP, Gene_Name, AA_Change))) %>%
mutate(mutation.type = paste(Gene_Name, SNP, sep = "-")) %>%
ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
geom_line(aes(group = SNP)) +
geom_ribbon(data = tissue.mean, aes(x=Tissue, ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
facet_wrap(~mutation.type) +
scale_fill_manual(values=c("#424ef5", "#cf1919")) +
xlab("Tissue") +
ylab("Allele Frequency") +
labs(fill="Haplotype", col = "Cluster") +
theme_bw(20) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(strip.text.x = element_text(size = 12))
## Joining, by = "SNP"
Cluster 18 (L-A15084G) is a mutation in Genome-1-1 that
was too low frequency to be called in the Frontal Cortex 2
sample. Cluster 6 (M-C4502T) is mutation that’s part of
Genome-2 but was too low coverage in the
Cerebellum. Finally, Cluster 9 (M-C4573T) is a mutation in
Genome-1 that was also too low coverage in the
Cerebellum. Generally, the Cerebellum has the
lowest coverage of any tissue, so this isn’t surprising. All-in-all,
this adds three mutations to already existing haplotypes. I already
addressed F-C7036T, N-G810A, and H-T7293C in the
identify-backgrounds.Rmd notebook. Finally, M-C4532T might
also belong as part of the reference?
I’ll annotate these variants in the main data frame.
# Clusters that are clearly genome-1
genome.1.missing.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(9)) %>%
pull(SNP) %>%
unique(.)
# Clusters that are clearly genome-1-1
genome.1.1.missing.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(18)) %>%
pull(SNP) %>%
unique(.)
# Clusters that are clearly genome-2
genome.2.missing.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(6)) %>%
pull(SNP) %>%
unique(.)
# Clusters that break the rules
maybe.in.both.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(8, 13, 2, 15)) %>%
pull(SNP) %>%
unique(.)
# Annotate the labeled data frame with what's possible here
updated.haplotype.df = labeled.df %>%
mutate(Background = case_when(SNP %in% genome.1.missing.snps ~ "genome-1",
SNP %in% genome.1.1.missing.snps ~ "genome-1",
SNP %in% genome.2.missing.snps ~ "genome-2",
SNP %in% maybe.in.both.snps ~ "both",
TRUE ~ Background)) %>%
mutate(Haplotype = case_when(SNP %in% genome.1.missing.snps ~ "genome-1",
SNP %in% genome.1.1.missing.snps ~ "genome-1-1",
SNP %in% genome.2.missing.snps ~ "genome-2",
SNP %in% maybe.in.both.snps ~ "both",
TRUE ~ Haplotype))
Now, lets look at the remainder of the clusters of SNPs. These are SNPs that actually clustered with other variants.
twentyfive.percent.or.more.clusters.df %>%
filter(n > 1) %>%
ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
geom_line(data = filter(twentyfive.percent.or.more.clusters.df, n > 1),
aes(x = factor(Tissue, levels = tissue_order), y = AF, group = SNP, col = (cluster)), size = 1) +
geom_ribbon(data = tissue.mean, aes(x=factor(Tissue, levels = tissue_order), ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
facet_wrap(~cluster_size) +
scale_fill_manual(values=c("#424ef5", "#cf1919")) +
xlab("Tissue") +
ylab("Allele Frequency") +
labs(fill="Haplotype", col = "Cluster") +
theme_bw(20) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(strip.text.x = element_text(size = 12))
All of these look like reasonable clusters, so I’ll annotate the
variant data frame with these new clusters (subclonal haplotypes). I’ll
also try to guess if these are on the background of
Genome-1 or Genome-2 if it’s possible.
# Genome-1
cluster.1.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(1)) %>%
pull(SNP) %>%
unique(.)
# Genome-1
cluster.2.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(11)) %>%
pull(SNP) %>%
unique(.)
# Genome-1
cluster.3.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(12)) %>%
pull(SNP) %>%
unique(.)
# Genome-2
cluster.4.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(14)) %>%
pull(SNP) %>%
unique(.)
# Genome-2
cluster.5.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(16)) %>%
pull(SNP) %>%
unique(.)
# Genome-2
cluster.6.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(17)) %>%
pull(SNP) %>%
unique(.)
# Genome-2
cluster.7.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(19)) %>%
pull(SNP) %>%
unique(.)
# ?
cluster.8.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(20)) %>%
pull(SNP) %>%
unique(.)
# ?
cluster.9.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(3)) %>%
pull(SNP) %>%
unique(.)
# ?
cluster.10.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(5)) %>%
pull(SNP) %>%
unique(.)
# ?
cluster.11.snps = twentyfive.percent.or.more.clusters.df %>%
filter(cluster %in% c(7)) %>%
pull(SNP) %>%
unique(.)
# Annotate the labeled data frame with what's possible here
updated.haplotype.df = updated.haplotype.df %>%
mutate(Background = case_when(SNP %in% cluster.1.snps ~ "unknown",
SNP %in% cluster.2.snps ~ "unknown",
SNP %in% cluster.3.snps ~ "unknown",
SNP %in% cluster.4.snps ~ "unknown",
SNP %in% cluster.5.snps ~ "unknown",
SNP %in% cluster.6.snps ~ "unknown",
SNP %in% cluster.7.snps ~ "unknown",
SNP %in% cluster.8.snps ~ "unknown",
SNP %in% cluster.9.snps ~ "unknown",
SNP %in% cluster.10.snps ~ "unknown",
SNP %in% cluster.11.snps ~ "unknown",
TRUE ~ Background)) %>%
mutate(Haplotype = case_when(SNP %in% cluster.1.snps ~ "cluster 1",
SNP %in% cluster.2.snps ~"cluster 2",
SNP %in% cluster.3.snps ~"cluster 3",
SNP %in% cluster.4.snps ~ "cluster 4",
SNP %in% cluster.5.snps ~ "cluster 5",
SNP %in% cluster.6.snps ~ "cluster 6",
SNP %in% cluster.7.snps ~ "cluster 7",
SNP %in% cluster.8.snps ~ "cluster 8",
SNP %in% cluster.9.snps ~ "cluster 9",
SNP %in% cluster.10.snps ~ "cluster 10",
SNP %in% cluster.11.snps ~ "cluster 11",
TRUE ~ Haplotype))
Now, I’ll move onto the next bin. These are mutations that reach at least 5% in one tissue but weren’t part of the previous analysis. Most of these will be challenging to haplotype due to noise. I’ll only try to specify the clusters that seem the most clear cut.
twentyfive.percent.or.fewer.snps = expanded.df %>%
filter(!SNP %in% twentyfive.percent.or.more.snps) %>%
filter(Haplotype == "subclonal") %>%
filter(AF < 0.25 & AF >= 0.05) %>%
pull(SNP) %>%
unique(.)
# Cluster the SNPs
twentyfive.percent.or.fewer.clusters.df = cluster.snps(list.of.snps = twentyfive.percent.or.fewer.snps,
snp.df = expanded.df,
n.clusters = 40)
# Mean of each tissue
twentyfive.percent.or.fewer.clusters.mean = twentyfive.percent.or.fewer.clusters.df %>%
group_by(Tissue, cluster, cluster_size, n) %>%
summarize(AF = mean(AF))
## `summarise()` has grouped output by 'Tissue', 'cluster', 'cluster_size'. You
## can override using the `.groups` argument.
# Plot the clusters
twentyfive.percent.or.fewer.clusters.df %>%
filter(n > 1) %>%
ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
geom_line(aes(group = SNP)) +
geom_line(data = filter(twentyfive.percent.or.fewer.clusters.mean, n > 1), aes(x = Tissue, y = AF, group = 1, col = (cluster)), size = 1) +
geom_ribbon(data = tissue.mean, aes(x=Tissue, ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
facet_wrap(~cluster_size) +
scale_fill_manual(values=c("#424ef5", "#cf1919")) +
xlab("Tissue") +
ylab("Allele Frequency") +
labs(fill="Haplotype", col = "Cluster") +
theme_bw(20) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(strip.text.x = element_text(size = 12))
Most of these aren’t very clear cut clusters. I took the 8 that looked the most promising.
promising.subclonal.clusters = c(1, 13, 2, 23, 25, 5, 6, 8)
twentyfive.percent.or.fewer.clusters.df %>%
filter(cluster %in% promising.subclonal.clusters) %>%
ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
geom_line(aes(group = SNP)) +
geom_line(data = filter(twentyfive.percent.or.fewer.clusters.mean, cluster %in% promising.subclonal.clusters),
aes(x = factor(Tissue, levels = tissue_order), y = AF, group = 1, col = (cluster)), size = 1) +
geom_ribbon(data = tissue.mean, aes(x=factor(Tissue, levels = tissue_order), ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
facet_wrap(~cluster_size) +
scale_fill_manual(values=c("#424ef5", "#cf1919")) +
xlab("Tissue") +
ylab("Allele Frequency") +
labs(fill="Haplotype", col = "Cluster") +
theme_bw(20) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(strip.text.x = element_text(size = 12))
We were able to cluster a reasonable number of the high-frequency mutations. How many mutations haven’t been haplotyped yet?
# Still labeled as 'subclonal'
mutations.with.no.info = updated.haplotype.df %>%
filter(Background == 'subclonal') %>%
pull(SNP) %>%
unique()
print(paste("There are about", length(mutations.with.no.info), "that are still totally un-haplotyped."))
## [1] "There are about 372 that are still totally un-haplotyped."
# Never above 5% - nearly impossible to haplotype
mutations.never.above.5perc = updated.haplotype.df %>%
filter(Background == 'subclonal') %>%
group_by(SNP) %>%
summarise(Max = max(AF)) %>%
filter(Max < 0.05) %>%
pull(SNP) %>%
unique()
# Mutations that have some info
mutations.with.info = updated.haplotype.df %>%
filter(Background != 'subclonal') %>%
pull(SNP) %>%
unique()
print(paste("There are", length(mutations.with.info), "that have been clustered."))
## [1] "There are 159 that have been clustered."
# Mutations that are left, but are above 5% some of the time.
mutations.left.to.haplotype = mutations.with.no.info[which(!mutations.with.no.info %in% mutations.never.above.5perc)]
print(paste("There are still about", length(mutations.left.to.haplotype), "that can reasonably be haplotyped."))
## [1] "There are still about 163 that can reasonably be haplotyped."
# What's the distribution of the frequency of these mutations?
updated.haplotype.df %>%
filter(SNP %in% mutations.left.to.haplotype) %>%
ggplot(aes(x = AF)) +
geom_histogram(bins = 30) +
theme_bw()
Here are the frequency of all current haplotypes in the 15 tissue samples.
haplotypes.label = updated.haplotype.df %>%
select(SNP, Haplotype, Background) %>%
distinct()
haplotype.order = c("cluster 1",
"cluster 2",
"cluster 3",
"cluster 4",
"cluster 5",
"cluster 6",
"cluster 7",
"cluster 8",
"cluster 9",
"cluster 10",
"cluster 11",
"cluster 12",
"cluster 13",
"cluster 14",
"cluster 15",
"cluster 16",
"cluster 17",
"cluster 18",
"cluster 19",
"genome-1-1")
# Expand the mutations to have a frequency for every tissue.
expanded.df = updated.haplotype.df %>%
select(SNP, Tissue, AF) %>%
pivot_wider(names_from = "Tissue", values_from = "AF", values_fill = 0) %>%
pivot_longer(cols = !SNP, names_to = "Tissue", values_to = "AF") %>%
left_join(., select(labeled.df, c("SNP", "Tissue", "DP")), by = c("SNP", "Tissue")) %>%
mutate(DP = if_else(is.na(DP), 0, DP)) %>%
left_join(., haplotypes.label, by = "SNP")
# Get the mean frequency of the major genomes
tissue.mean = updated.haplotype.df %>%
filter(Haplotype %in% c("genome-1", "genome-2")) %>%
group_by(Tissue, Haplotype) %>%
summarize(AF.mean = mean(AF, na.rm = TRUE),
SD = sd(AF, na.rm = TRUE),
N = n()) %>%
mutate(SE = SD / sqrt(N),
Lower.CI = qt(1 - (0.05 / 2), N - 1) * SE,
Upper.CI = qt(1 - (0.05 / 2), N - 1) * SE) %>%
rename("AF" = AF.mean, "Genotype" = Haplotype)
haplotype.mean = expanded.df %>%
group_by(Tissue, Haplotype) %>%
summarize(AF = mean(AF))
expanded.df %>%
filter(!Haplotype %in% c("fixed", "both", "subclonal", "genome-1", "genome-2")) %>%
ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
geom_line(aes(group = SNP)) +
geom_line(data = filter(haplotype.mean, !Haplotype %in% c("fixed", "both", "subclonal", "genome-1", "genome-2")),
aes(x = factor(Tissue, levels = tissue_order), y = AF, group = 1, col = (Haplotype)), size = 1) +
geom_ribbon(data = tissue.mean, aes(x=factor(Tissue, levels = tissue_order), ymin=AF-SD, ymax=AF+SD, group = Genotype, fill = Genotype), alpha=0.2, colour = NA) +
facet_wrap(~factor(Haplotype, levels = haplotype.order)) +
scale_fill_manual(values=c("#424ef5", "#cf1919")) +
xlab("Tissue") +
ylab("Allele Frequency") +
labs(fill="Genotype", col = "Cluster") +
theme_bw(20) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(strip.text.x = element_text(size = 12))
# ggsave("../../results/figures/all-current-clusters.png", width = 20, height = 12, dpi = 300)
write_csv(updated.haplotype.df, output.path)